Let’s practice bring up a very simple little map of the US from
maps and ggplot2. The maps
package provides latitude and longitude data for various in the package.
The vignette for the package can be found here if
you’d more information on the package. Although we will do something
much more complicated for our Project 2, for this little exploration
let’s just keep it simple by taking advantage of everything that is
pre-loaded in maps.
First, we need to extract state-level geographic information using
the map_data() function. Similar functions exist for other
geographic levels of data, including world (country), county, or other
regions of the world, but we will be using state.
states <- map_data("state")
| long | lat | group | order | region | subregion |
|---|---|---|---|---|---|
| -87.46201 | 30.38968 | 1 | 1 | alabama | NA |
| -87.48493 | 30.37249 | 1 | 2 | alabama | NA |
| -87.52503 | 30.37249 | 1 | 3 | alabama | NA |
| -87.53076 | 30.33239 | 1 | 4 | alabama | NA |
| -87.57087 | 30.32665 | 1 | 5 | alabama | NA |
| -87.58806 | 30.32665 | 1 | 6 | alabama | NA |
states variables are:long = longitude in unprojected
geographic coordinates (i.e., longitude and latitude on a sphere or
ellipsoid like WGS84); ranges from -180 to 180
lat = latitude in unprojected
geographic coordinates (i.e., longitude and latitude on a sphere or
ellipsoid like WGS84); ranges from -90 to 90
group = grouping variable for polygon boundaries
where all points in a given group represent a single polygon, like parts
of a country, state or county. E.g., here Alabama is treated as a single
group (labeled as 1 because its the first state alphabetically
order = indicates the order in which points should
be connected to draw the polygon; helps when connecting
long and lat coordinates to ensure the shape
is drawn in the correct sequence
region = represents the primary geographic unit
requested, e.g., here, we requested “state” so the region is takes the
state names as labels
subregion = refers to a subdivision within a region,
but is only present for detailed maps like “county” not “state”, hence
we have no subregion here
Now, let’s make a base map leveraging the ggplot2
package using the states dataset. Notice that
x will always be longitude, y will always be
latitude, and that we must also pass it a grouping variable. Why do you
think we need the group variable here?
Great, we’ve made our first map of the continental US! It’s pretty uninteresting just as-is, so let’s use this map to overlay our pneumonia-related hospital readmissions data.
## Now make a new column named region
pneumoniaFull <- pneumoniaFull %>%
mutate(region = getStateName(State)) ## Naming it 'region' to make the name in the states dataframe
PredictedReadmissionsRate by stateNow, before we add the pneumonia-related hospital readmissions to the
map, we need to aggregate it at the state-level. We
will want to do that by aggregating across states and taking the
mean() and median().
Start from my code and finish the aggregation by state for the
mean() and median() by filling in the blanks.
Notice that we are making a separate dataset called
stateAggPneumonia. Make sure to uncomment the code when
you’re ready to run it.
# stateAggPneumonia <- ___ %>%
# ## Grab readmissions and region
# select(PredictedReadmissionRate, ___,
# `Score_Death rate for pneumonia patients`,
# `Score_Hospital return days for pneumonia patients`,
# `Score_Medicare spending per patient`) %>%
# ## Change region to a factor
# mutate(region = as.factor(___)) %>%
# ## Change any other columns from character to numeric
# mutate_if(is.character, ___) %>%
# ## Group by region
# group_by(___) %>%
# ## Create the new aggregated variables for each measure
# summarize(meanReadmissions = mean(PredictedReadmissionRate, na.rm = TRUE),
# medianReadmissions = ___(PredictedReadmissionRate, na.rm = TRUE),
# meanDeathRate = mean(`Score_Death rate for pneumonia patients`,
# na.rm = ___),
# medianDeathRate = ___(`Score_Death rate for pneumonia patients`,
# na.rm = TRUE),
# meanHospitalReturnDays = ___,
# medianHospitalReturnDays = ___,
# meanMedicareSpending = ___,
# medianMedicareSpending = ___) %>%
# ## Remove the missing region
# filter(!is.na(region))
stateAggPneumonia <- pneumoniaFull %>%
## Grab readmissions and region
select(PredictedReadmissionRate, region,
`Score_Death rate for pneumonia patients`,
`Score_Hospital return days for pneumonia patients`,
`Score_Medicare spending per patient`) %>%
## Change region to a factor
mutate(region = as.factor(region)) %>%
## Group by region
group_by(region) %>%
summarize(meanReadmissions = mean(PredictedReadmissionRate, na.rm = TRUE),
medianReadmissions = median(PredictedReadmissionRate, na.rm = TRUE),
meanDeathRate = mean(`Score_Death rate for pneumonia patients`, na.rm = TRUE),
medianDeathRate = median(`Score_Death rate for pneumonia patients`, na.rm = TRUE),
meanHospitalReturnDays = mean(`Score_Hospital return days for pneumonia patients`, na.rm = TRUE),
medianHospitalReturnDays = median(`Score_Hospital return days for pneumonia patients`, na.rm = TRUE),
meanMedicareSpending = mean(`Score_Medicare spending per patient`, na.rm = TRUE),
medianMedicareSpending = median(`Score_Medicare spending per patient`, na.rm = TRUE)) %>%
filter(!is.na(region))
save(stateAggPneumonia, file = "stateAggPneumonia.Rdata")
load(file = "stateAggPneumonia.Rdata")
Are there any missing data? If so, what data are missing? Did you make a mistake?
Your answer here.
Choosing the mean or median value here is going to be really important for anyone interpreting our maps, as well as spatial statistics we will calculate downstream. Recall that we can choose the mean when: - The data is symmetric (i.e., it is approximately normal or has that beautiful bell-shaped curve!) - There are no extreme outliers or heavy skew - You are working with data that were measured on a truly continuous scale (i.e., height, weight) OR can be approximated by a normal distribution because of a sufficiently large sample size (see the Central Limit Theorem) - You want to describe an “average” - You are using statistical methods that assume normality (i.e., parametric tests)
On the other hand, we will choose the median when: - The data are skewed (i.e., there us long tail on one side of the curve) OR the data are lepto- or platykurtic (overly “peaked” or flattened, respectively) - Skewness Guidelines: - Skewness > 0 \(\rightarrow\) right-skewed \(\rightarrow\) prefer median - Skewness \(\approx\) 0 \(\rightarrow\) symmetric \(\rightarrow\) prefer mean - Skewness < 0 \(\rightarrow\) left-skewed \(\rightarrow\) prefer median - Kurtosis Guidelines: - Kurtosis > 3 \(\rightarrow\) Leptokurtic (sharp peak, heavy tails) \(\rightarrow\) prefer median - Kurtosis \(\approx\) 3 \(\rightarrow\) Mesokurtic (bell-shaped) \(\rightarrow\) prefer mean - Kurtosis < 3 \(\rightarrow\) Platykurtic (flat peak, light tails) \(\rightarrow\) prefer median - There are outliers that could distort the mean - You want to describe the “typical value” or “middle case” - You have ordinal data (e.g., patient satisfaction ratings) - You are using non-parametric statistical methods that do not assume normality because you found that your data were not normally distributed
You may have to do some research if you do not recall, but why would showing someone a mean when we should have showed them a median, or vice versa, potentially create interpretation problems for us or our stakeholders?
Your answer here.
Practice reading boxplots to decide, for each measure, whether you should use the mean or median, or whether it matters. Need some help with boxplots? This article breaks it down nicely.
Based on Figure 1 below, which measure(s) do you think would benefit from the median over the mean? Why?
Your answer here.
Figure 1. Violin and boxplots to assess distribution symmetry
In Project 1, I introduced the Shapiro-Wilk test for normality for those who were not familiar with it. As a reminder, the Shapiro-Wilk test is a goodness-of-fit test to assess whether an approximately normal transformation matches our data, with the null hypothesis that it does. It does this by computing a \(W\) statistic compares ordered sample values to the corresponding normal distribution quantiles. As I mentioned before, it typically has high power to detect deviations from normality in most situations unless \(N>5000\).
But normality tests (and Shapiro-Wilk is not the only one!) is not a sufficient test of skewness or kurtosis. E.g., we can have enough skew or kurtosis that our distribution would be better described by a median over a mean, yet it does not deviate at the 5% significance threshold from a normal approximation! In other words, it can look normal, not deviate from normal per a Shapiro-Wilk test, and yet still be sufficiently skewed that the median is a better choice. One reason this can happen, for example, is if you have a lot of repeats of the same value, as seen in count data or some quantitative measures that have been rounded to integers.
Enter D’Agostino’s \(K^2\) or Skewness Test. Also sometimes referred to as the “D’Agostino-Pearson Test” in some statistics sources, the D’Agostino test was also designed as a goodness-of-fit test for distribution departures from normality. However, it simply doesn’t perform quite as well overall as a Shapiro-Wilk for tests of deviation from normality BUT it is superior for assessing differences in skewness or kurtosis! What the test does is computes two components, (1) Skewness to check for asymmetry and (2) Kurtosis to check for peakedness or flatness, and combines these components into a \(K^2\) test statistic, a version of the \(\chi^2\) statistic.
Let’s use the agostino.test() from the
moments package to apply this to our data. For our first
example, I will use PredictedReadmissionRate. Remember, the
null hypothesis, \(H_0\), is that there
is no skew. So, a \(p <
0.05\) means we reject the null hypothesis of
symmetry and conclude that the data are significantly skewed.
However, is \(p \ge 0.05\), then there
is no significant skewness to the data.
agostino.test(pneumoniaFull$PredictedReadmissionRate)
##
## D'Agostino skewness test
##
## data: pneumoniaFull$PredictedReadmissionRate
## skew = 0.48045, z = 9.75681, p-value < 0.00000000000000022
## alternative hypothesis: data have a skewness
Interpretation: Because \(p<0.05\), we reject the hypothesis of symmetry and conclude that predicted readmissions are significantly skewed. We should use the median to aggregate this variable.
Your task is to repeat this test, with interpretations, for
Score_Death rate for pneumonia patients,
Score_Hospital return days for pneumonia patients, and
Score_Medicare spending per patient:
# Your code here.
Interpretation: …
# Your code here.
Interpretation: …
# Your code here.
Interpretation: …
One final thing we can do is assess the stability of the estimates of central tendency we’ve calculated, the mean and the median to ensure that, post-aggregation, we don’t see any issues in the resulting distribution. If we did, we would want to consider another estimate we haven’t yet explored - e.g., another percentile or even a weighted-mean.
Based on Figure 2 below, and using your choices from questions 8A and 8B, assess the stability (i.e., symmetry) of the estimates you selected as the best (either mean or median) for each of the four variables. Do you see any cause for concern with your choice, or do you feel comfortable to proceed with what you’ve selected?
Figure 2. Compare stability of mean vs. median after aggregation
The time has come to make your final selection! Your job is to
remove the columns from stateAggPneumonia
that do not correspond with your final decisions of mean or median for
questions 8A-C.
# Your code here.
stateAggPneumonia with the
states data for mapping.Let’s use a left_join on stateAggPneumonia
on the states dataframe we made in Section 1. Make sure to join by the shared
geographic identifier; here that is region.
mergedStateAggPneumonia <- left_join(stateAggPneumonia, states, by = "region")
Recall from lecture that a choropleth map is a type of thematic map where areas (such as states, counties, or other geographic regions) are shaded or colored in relative proportion to the quantative value of a variable.
What follows is code to perform chloropeth mapping in
ggplot2 building on our base map from Section 1. I will show you how to do the first one
for median Predicted Readmission Rate. You will be
mapping the subsequent variables based on your choices of mean or
median!
NOTE: At this stage, you will get an error if you
did NOT choose to retain medianReadmissions! If you made
that mistake, return to this step!
Do you notice any trends that could be worth testing? If you’re not sure, take a look at this sorted table showing the top 10 states for median pneumonia-related hospital readmissions:
| Region | Median Predicted Pneumonia Readmissions |
|---|---|
| district of columbia | 18.79 |
| nevada | 18.10 |
| new jersey | 17.91 |
| west virginia | 17.71 |
| michigan | 17.46 |
| illinois | 17.40 |
| california | 17.35 |
| florida | 16.99 |
| kentucky | 16.98 |
| new york | 16.88 |
HINT: It may not be readily obvious from the map or table! If you don’t see a hypothesis, that’s okay. But try to think about anything some of these states might have in common, if anything.
Your task is to now update the code I used to make p1
for the remaining three variables. I suggest copying the code and
following the comments to update everywhere I have !!Update
written. You should update per the following specification for
full credit:
Pneumonia Death Rate (whether mean or median, your
choice from earlier) \(\rightarrow\)
color should be skittles[2]
Medicare Spending (whether mean or median, your choice
from earlier) \(\rightarrow\) color
should be skittles[6]
Pneumonia Hospital Return Days (whether mean or median,
your choice from earlier) \(\rightarrow\) color should be
skittles[3]
## Load the training and testing sets
load("../../I. Biomedical & Clinical Informatics/Project_1/FY2024_data_files/readyTrain.Rdata")
load("../../I. Biomedical & Clinical Informatics/Project_1/FY2024_data_files/readyTest.Rdata")
## swap out state with state names
readyTrain <- readyTrain %>%
select(-State) %>%
mutate(FacilityId = rownames(readyTrain))
## swap out state with state names
readyTest <- readyTest %>%
select(-State) %>%
mutate(FacilityId = rownames(readyTest))
## now merge with pneumoniaFull to get the state names
stateTrain <- readyTrain %>%
left_join((pneumoniaFull %>%
select(FacilityId,
region,
PredictedReadmissionRate,
`Score_Medicare spending per patient`,
`Score_Hospital return days for pneumonia patients`,
`Score_Death rate for pneumonia patients`)), by = "FacilityId") %>%
column_to_rownames("FacilityId") %>%
rename(Median_RawPredictedReadmissionRate = PredictedReadmissionRate,
Median_RawMedicareSpending = `Score_Medicare spending per patient`,
Median_RawDeathRate = `Score_Death rate for pneumonia patients`,
Median_RawHospitalReturnDays = `Score_Hospital return days for pneumonia patients`)
## now merge with pneumoniaFull to get the state names
stateTest <- readyTest %>%
left_join((pneumoniaFull %>%
select(FacilityId,
region,
PredictedReadmissionRate,
`Score_Medicare spending per patient`,
`Score_Hospital return days for pneumonia patients`,
`Score_Death rate for pneumonia patients`)), by = "FacilityId") %>%
column_to_rownames("FacilityId") %>%
rename(Median_RawPredictedReadmissionRate = PredictedReadmissionRate,
Median_RawMedicareSpending = `Score_Medicare spending per patient`,
Median_RawDeathRate = `Score_Death rate for pneumonia patients`,
Median_RawHospitalReturnDays = `Score_Hospital return days for pneumonia patients`)
## Now aggregate again - medians all around!
stateAggTrain <- stateTrain %>%
group_by(region) %>%
summarise(across(where(is.numeric), median, na.rm = TRUE))
## Now aggregate again - medians all around!
stateAggTest <- stateTest %>%
group_by(region) %>%
summarise(across(where(is.numeric), median, na.rm = TRUE))
To perform spatial analysis, one needs geographic-level aggregate data that has been joined with geometry. Although it might feel like we technically have everything we need, alas, we do not - quite.
Great! But because our ultimate goal is to try to tie this back with all of the work we did in Project 1, I will actually give you some data - that you get by rinning this code chunk - that will give you both the training testing sets, aggregated but not yet merged with geometries, so we don’t lose our other encoding and especially our transformations for OLS. ## Ask them to check that all of the states made it into test and train.
Identify your geographic level of aggregation for each variable. [ Here, we have selected state ]
Get the right geometries for spatial analysis. Sadly, what we used for mapping will not be sufficient. Recall that, in lecture, I kept referring to something called shapefiles.
Calculate spatial weights.
Calculate the Global Moran’s \(I\), a measure of global spatial autocorrelation.
Calculate the Local Moran’s \(I\), a measure of local spatial autocorrelation.
Calculate the Spatial Lag to perform a spatial lag regression (SAR).
If you’re working with spatial autocorrelation, the sf and spdep packages are more precise. However, if you’re using maps, convert it to sf for easier integration.
## Get map geometries
states_sf <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
counties_sf <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
It is at this stage you would add FIPS # Add FIPS or names for merging
Use spdep::poly2nb() and spdep::nb2listw() to create spatial weights.
# The sf package now uses Google's S2 geometry engine by default, which is strict and throws errors when geometries are even slightly invalid (like crossing edges). You can turn it off just for this step, which often resolves the issue:
# Step 1: Disable S2
sf_use_s2(FALSE)
# Step 2: Use spData's cleaned `states_sf`
states_sf <- states_sf # this is already an sf object
states_sf <- st_make_valid(states_sf) # just to be safe
# Step 3: Create neighbors
state_nb <- poly2nb(states_sf)
## State spatial weights
state_sw <- nb2listw(state_nb, style = "W")
# Step 4: Optional — turn S2 back on
sf_use_s2(TRUE)
unique(states_sf$ID)
## [1] "alabama" "arizona" "arkansas"
## [4] "california" "colorado" "connecticut"
## [7] "delaware" "district of columbia" "florida"
## [10] "georgia" "idaho" "illinois"
## [13] "indiana" "iowa" "kansas"
## [16] "kentucky" "louisiana" "maine"
## [19] "maryland" "massachusetts" "michigan"
## [22] "minnesota" "mississippi" "missouri"
## [25] "montana" "nebraska" "nevada"
## [28] "new hampshire" "new jersey" "new mexico"
## [31] "new york" "north carolina" "north dakota"
## [34] "ohio" "oklahoma" "oregon"
## [37] "pennsylvania" "rhode island" "south carolina"
## [40] "south dakota" "tennessee" "texas"
## [43] "utah" "vermont" "virginia"
## [46] "washington" "west virginia" "wisconsin"
## [49] "wyoming"
#states_sf <- left_join(states_sf, stateAggPneumonia, by = c("ID" = "region"))
states_sf_AggTrain <- left_join(states_sf, stateAggTrain, by = c("ID" = "region"))
Set the target:
y <- "Median_RawPredictedReadmissionRate"
# Moran's I - State
moran_state <- moran.test(states_sf_AggTrain[[y]],
state_sw)
print(moran_state)
##
## Moran I test under randomisation
##
## data: states_sf_AggTrain[[y]]
## weights: state_sw
##
## Moran I statistic standard deviate = 2.9895, p-value = 0.001397
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.268531084 -0.020833333 0.009368833
moran.plot(states_sf_AggTrain[[y]],
listw = state_sw,
ylab = "Spatially-lagged Median Pneumonia Readmissions",
xlab = "Median Pneumonia Readmissions (No Lag)", pch = 16,
col = skittles[1], xlim = c(13,20),
main = "Figure 3. Global Moran's I (State-Level Aggregation)")
Moran’s I statistic (ranges from -1 to +1) Expected value under null
hypothesis Standard dev. and z-score p-value Interpretation:
Positive Moran’s I: clustering of similar values (high with high, low with low) Negative Moran’s I: dispersion (high values near low values) p-value < 0.05 suggests significant spatial autocorrelation
Global Moran’s I tells you whether clustering exists overall.
Nice — you’ve got a significant positive Moran’s I for medianReadmissions!
What this means: Moran’s I = 0.246 (positive)**: moderate positive spatial autocorrelation. States with similar median readmission rates tend to cluster geographically. p-value = 0.0029 (< 0.05): this spatial clustering is statistically significant. The test was done under the randomization assumption, meaning values are randomly distributed over locations under the null hypothesis.
🔥 Hotspots: A hotspot (high-high) is a location with a high value that is surrounded by other high values. If a state has a high readmission rate but is surrounded by low readmission rate states, it may not be a hotspot — it could even be labeled an outlier (high-low). ❄️ Cold spots: A cold spot (low-low) is a low-value state in a cluster of other low-value states. A state with a low rate that’s surrounded by high-rate states could be a low-high outlier, not a cold spot.
🧠 Key concept: LISA is measuring local spatial autocorrelation, which answers:
“Is this value similar to its neighbors’ values?” It is not just:
“Is this value high or low?” 📌 Example: Let’s say:
local_moran <- localmoran(states_sf_AggTrain[[y]], state_sw)
states_sf_AggTrain$localI <- local_moran[,1]
states_sf_AggTrain$p_value <- local_moran[,5]
# Optional: identify high-high clusters
states_sf_AggTrain$cluster <- if_else(states_sf_AggTrain$p_value < 0.05 & states_sf_AggTrain$localI > 0,
"Hotspots", ifelse(states_sf_AggTrain$p_value < 0.05 &
states_sf_AggTrain$localI < 0,
"Coldspots", "Other"))
states_sf_AggTrain %>%
filter(cluster != "Other") %>%
as_tibble() %>%
select(ID, cluster, -geom) %>% arrange(cluster)
## # A tibble: 10 × 2
## ID cluster
## <chr> <chr>
## 1 indiana Coldspots
## 2 colorado Hotspots
## 3 idaho Hotspots
## 4 maryland Hotspots
## 5 nebraska Hotspots
## 6 ohio Hotspots
## 7 pennsylvania Hotspots
## 8 utah Hotspots
## 9 virginia Hotspots
## 10 wyoming Hotspots
# print(local_moran)
ggplot(states_sf_AggTrain) +
geom_sf(aes(fill = localI), color = "black", size = 0.2) + # state borders in black
scale_fill_gradient2(
low = skittles[1], mid = "white", high = skittles[2], midpoint = 0,
name = "Local Moran's I (LISA)"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(hjust = 0.5, face = "bold", size = 16)
) +
labs(title = "Local Moran's I (LISA) for Median Readmission")
p1
Why don’t these line up with p1? Question for students explain why hotspots and cold spots will not necessarily correspond with high and low values on the map ChatGPT said: Great question — and an important one when interpreting Local Moran’s I (LISA) maps.
🔍 Why hotspots and cold spots do not necessarily correspond to high and low values on the map: Local Moran’s I detects spatial clustering of similar values, not just individual high or low values. That means:
California has a moderately high readmission rate, and all surrounding states (e.g., Oregon, Nevada, Arizona) also have high rates then LISA flags California as a hotspot. Florida might have the highest rate in the country — but if it’s surrounded by low or average states, it won’t be flagged as a hotspot.
Summary:
| Value | Neighbors | LISA Category | Explanation |
|---|---|---|---|
| High | High | Hotspot (High-High) | A state with high pneumonia readmissions is surrounded by other high states |
| Low | Low | Coldspot (Low-Low) | A state with low pneumonia readmissions is surrounded by other low states |
| High | Low | Outlier (High-Low) | A state with high pneumonia readmissions is surrounded by low rate states |
| Low | High | Outlier (Low-High) | A state with low pneumonia readmissions is surrounded by high rate states |
| High/Low | Mixed/Random | Not Significant (N.S.) | No difference from random (dispersed) |
# Step 1: Run local Moran's I
localm <- localmoran(states_sf_AggTrain[[y]], state_sw)
# Step 2: Convert to data frame
local_df <- as.data.frame(localm)
colnames(local_df)[1] <- "localI"
# Step 3: Add lag values (spatial lag of variable)
states_sf_AggTrain$lag_medianReadmissions <- lag.listw(state_sw, states_sf_AggTrain[[y]])
# Step 4: Join localI and p-value
states_sf_AggTrain <- states_sf_AggTrain %>%
mutate(
localI = local_df$localI,
p_value = local_df$`Pr(z != E(Ii))`
)
# Step 5: Classify LISA cluster types
states_sf_AggTrain <- states_sf_AggTrain %>%
mutate(
lisa_cluster = case_when(
p_value > 0.05 ~ "Not significant",
!!sym(y) > mean(!!sym(y), na.rm = TRUE) &
lag_medianReadmissions > mean(!!sym(y), na.rm = TRUE) ~ "Hotspot",
!!sym(y) < mean(!!sym(y), na.rm = TRUE) &
lag_medianReadmissions < mean(!!sym(y), na.rm = TRUE) ~ "Cold spot",
!!sym(y) > mean(!!sym(y), na.rm = TRUE) &
lag_medianReadmissions < mean(!!sym(y), na.rm = TRUE) ~ "High-Low outlier",
!!sym(y) < mean(!!sym(y), na.rm = TRUE) &
lag_medianReadmissions > mean(!!sym(y), na.rm = TRUE) ~ "Low-High outlier",
TRUE ~ "Unclassified"
)
)
ggplot(states_sf_AggTrain) +
geom_sf(aes(fill = lisa_cluster), color = "white") +
scale_fill_manual(
name = "LISA Cluster Type",
values = c(
"Hotspot" = skittles[2],
"Cold spot" = skittles[1],
"High-Low outlier" = "darkorange",
"Low-High outlier" = "skyblue",
"Not significant" = "grey80"
)
) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(
title = "Local Moran’s I Cluster Map of Median Readmission Rates",
subtitle = "Hotspots and cold spots show spatial clusters of similar values"
)
LISA highlights local hot/cold spots.
You can:
Fit OLS and Elastic Net as standard models on your training set, But they are not spatial models — they ignore spatial dependence. If your residuals are spatially autocorrelated, you need spatial regression models like: Spatial lag model (SAR): outcome depends on neighboring outcomes. Spatial error model (SEM): residuals are spatially correlated. Spatial Durbin model (SDM): includes spatial lags of both predictors and the outcome.
Step-by-Step: Spatial Regression with OLS + Elastic Net + Spatial Models
Have them explain why we are fitting it again!! (hint aggregation)
regData <- states_sf_AggTrain %>%
as_tibble() %>%
select(-bc_PredictedReadmissionRate, -Median_RawMedicareSpending,
-Median_RawHospitalReturnDays, -Median_RawDeathRate, -ID,
-localI, -p_value, -lisa_cluster, -geom,
-contains("Compared"), -lag_medianReadmissions)
ols_model <- lm(Median_RawPredictedReadmissionRate ~ ., data = regData)
#summary(ols_model)
# Check residuals for spatial autocorrelation:
# Moran's I test on residuals
moran.test(residuals(ols_model), state_sw)
##
## Moran I test under randomisation
##
## data: residuals(ols_model)
## weights: state_sw
##
## Moran I statistic standard deviate = 2.3168, p-value = 0.01026
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.202895259 -0.020833333 0.009324998
If significant, residuals are spatially autocorrelated — OLS is misspecified.
# SAR (spatial lag model)
sar_model <- lagsarlm(Median_RawPredictedReadmissionRate ~ .,
data = regData, listw = state_sw)
# SEM (spatial error model)
sem_model <- errorsarlm(Median_RawPredictedReadmissionRate ~ .,
data = regData, listw = state_sw)
summary(sar_model)
##
## Call:
## lagsarlm(formula = Median_RawPredictedReadmissionRate ~ ., data = regData,
## listw = state_sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.824130 -0.322515 -0.043459 0.274045 1.223262
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate
## (Intercept) 12.484591
## MRSA.Bacteremia -0.719226
## Abdomen.CT.Use.of.Contrast.Material -0.283273
## Death.rate.for.pneumonia.patients -0.017269
## Postoperative.respiratory.failure.rate -1.175855
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 1.345399
## Composite.patient.safety -0.338488
## Medicare.spending.per.patient 0.932828
## Healthcare.workers.given.influenza.vaccination -0.077218
## Left.before.being.seen 0.438677
## Venous.Thromboembolism.Prophylaxis 0.353942
## Nurse.communication 2.573345
## Doctor.communication -0.426505
## Staff.responsiveness 0.266106
## Communication.about.medicines -1.017975
## Discharge.information -1.070604
## Care.transition 0.735726
## Cleanliness 0.130783
## Quietness -0.172738
## Overall.hospital.rating -0.745781
## Recommend.hospital -1.549243
## SurveyResponseRate -0.078888
## PaymentCategory.for.pneumonia.patients -0.091533
## clusterHotspots 0.411812
## clusterOther 0.456507
## Std. Error
## (Intercept) 2.080875
## MRSA.Bacteremia 0.468432
## Abdomen.CT.Use.of.Contrast.Material 0.278001
## Death.rate.for.pneumonia.patients 0.334968
## Postoperative.respiratory.failure.rate 0.546390
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.504204
## Composite.patient.safety 0.384396
## Medicare.spending.per.patient 0.262337
## Healthcare.workers.given.influenza.vaccination 0.202970
## Left.before.being.seen 0.281115
## Venous.Thromboembolism.Prophylaxis 0.387488
## Nurse.communication 0.569740
## Doctor.communication 0.411280
## Staff.responsiveness 0.369322
## Communication.about.medicines 0.544723
## Discharge.information 0.414237
## Care.transition 0.645604
## Cleanliness 0.323651
## Quietness 0.261989
## Overall.hospital.rating 0.692466
## Recommend.hospital 0.698795
## SurveyResponseRate 0.284825
## PaymentCategory.for.pneumonia.patients 0.273542
## clusterHotspots 0.632985
## clusterOther 0.687756
## z value Pr(>|z|)
## (Intercept) 5.9997 1.977e-09
## MRSA.Bacteremia -1.5354 0.1246877
## Abdomen.CT.Use.of.Contrast.Material -1.0190 0.3082207
## Death.rate.for.pneumonia.patients -0.0516 0.9588843
## Postoperative.respiratory.failure.rate -2.1520 0.0313939
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 2.6684 0.0076221
## Composite.patient.safety -0.8806 0.3785500
## Medicare.spending.per.patient 3.5558 0.0003768
## Healthcare.workers.given.influenza.vaccination -0.3804 0.7036193
## Left.before.being.seen 1.5605 0.1186447
## Venous.Thromboembolism.Prophylaxis 0.9134 0.3610176
## Nurse.communication 4.5167 6.281e-06
## Doctor.communication -1.0370 0.2997273
## Staff.responsiveness 0.7205 0.4712023
## Communication.about.medicines -1.8688 0.0616514
## Discharge.information -2.5845 0.0097515
## Care.transition 1.1396 0.2544558
## Cleanliness 0.4041 0.6861497
## Quietness -0.6593 0.5096817
## Overall.hospital.rating -1.0770 0.2814833
## Recommend.hospital -2.2170 0.0266217
## SurveyResponseRate -0.2770 0.7818029
## PaymentCategory.for.pneumonia.patients -0.3346 0.7379097
## clusterHotspots 0.6506 0.5153135
## clusterOther 0.6638 0.5068415
##
## Rho: 0.19982, LR test value: 2.0124, p-value: 0.15602
## Asymptotic standard error: 0.12232
## z-value: 1.6336, p-value: 0.10235
## Wald statistic: 2.6686, p-value: 0.10235
##
## Log likelihood: -30.49428 for lag model
## ML residual variance (sigma squared): 0.2013, (sigma: 0.44867)
## Number of observations: 49
## Number of parameters estimated: 27
## AIC: 114.99, (AIC for lm: 115)
## LM test for residual autocorrelation
## test value: 4.5452, p-value: 0.033011
You’re reading the output of a Spatial Lag Model (SAR) and trying to assess whether it’s warranted compared to a standard OLS regression.
Here’s how to interpret this step-by-step, followed by what to do next:
Interpretation of SAR Output
Spatial Autoregressive Coefficient (Rho) Rho: 0.1846: This is the estimated spatial lag coefficient. p-value: 0.13304: Not statistically significant at the 0.05 level. This suggests that the spatial dependency in the outcome variable is not strong enough to justify including the lag term in this model.
Likelihood Ratio Test LR test value: 1.7796, p-value: 0.18219: Compares the SAR model to OLS. Not significant → SAR doesn’t significantly improve the fit over OLS.
Wald Test for Rho Wald statistic: 2.2567, p-value: 0.13304: Also supports Rho is not significantly different from 0.
Log Likelihood and AIC SAR: AIC = 111.43, OLS: AIC = 111.21 Since lower AIC is better, and the SAR model has higher AIC, this again favors the OLS model.
LM Test for Residual Spatial Autocorrelation Test value: 3.2964, p = 0.0694: This test asks whether residuals still have spatial autocorrelation. This is marginal — it’s close to significant at the 0.05 level, but not quite. Summary Test Result Interpretation Rho Not significant No strong spatial dependence in Y LR Test Not significant SAR not better than OLS AIC Higher in SAR OLS preferred LM Test on residuals p = 0.069 Weak evidence of residual autocorrelation Conclusion: Stick with OLS or Elastic Net unless there is strong domain justification for including spatial lags. SAR does not significantly improve model performance in your case.
What to Do Next
Use OLS or Elastic Net (if your goal is prediction). If you’re concerned about spatial autocorrelation: Consider Spatial Error Model (SEM): It accounts for spatial autocorrelation in residuals. sem_model <- errorsarlm(target ~ ., data = train_data, listw = train_listw) summary(sem_model) Especially if your LM test stays near-significant. For Elastic Net with spatial awareness: Add spatially lagged predictors to your model matrix. For each predictor X, compute its spatial lag: train_data\(lag_X1 <- lag.listw(train_listw, train_data\)X1) Then include both X1 and lag_X1 in the Elastic Net model. Let me know your priority — inference or prediction — and I’ll tailor the next steps!
Great — since your priority is prediction, let’s walk through how to do spatially-informed predictive modeling using OLS and Elastic Net with spatial lags, assuming you already have:
A fully preprocessed and normalized training set (train_data) A test set (test_data) A listw object (e.g., train_listw) representing spatial relationships between training observations Step 1: Regular OLS (Baseline)
Start with a standard OLS for benchmarking:
ols_model <- lm(target ~ ., data = train_data) pred_ols <- predict(ols_model, newdata = test_data) Evaluate with RMSE, MAE, or R2.
Step 2: Add Spatial Lagged Predictors
Use lag.listw() to create spatial lags of predictors (not the outcome), and add them to your feature matrix:
library(spdep)
predictors <- names(train_data)[!names(train_data) %in% c(“target”)] for (var in predictors) { train_data[[paste0(“lag_”, var)]] <- lag.listw(train_listw, train_data[[var]]) } To do this for the test set, you must compute spatial weights linking test observations to the training ones — more complex. If your spatial unit is at the state level or county level, and you have a complete map of spatial neighbors, you may precompute test lags manually or use centroids & distance-based neighbors.
But for simplicity, assume spatial lags are only available for training set for now.
Step 3: Elastic Net with Cross-Validation
Use the glmnet package on the extended feature matrix (including lags):
library(glmnet)
X_train <- model.matrix(target ~ . -1, data = train_data) # Drop intercept here y_train <- train_data$target
set.seed(123) cv_model <- cv.glmnet(X_train, y_train, alpha = 0.5) # alpha = 0.5 → Elastic Net
plot(cv_model)
best_lambda <- cv_model$lambda.min
final_model <- glmnet(X_train, y_train, alpha = 0.5, lambda = best_lambda) To make predictions:
X_test <- model.matrix(target ~ . -1, data = test_data) pred_elastic <- predict(final_model, newx = X_test) Optional: Spatial Lags for Test Set
To do this properly, you’d need to:
Merge test and training sets temporarily Create a listw object over all observations Recompute spatial lags for all points Then separate the sets again Let me know if you’d like help implementing this part.
Evaluate Predictions
sqrt(mean((pred_elastic - test_data$target)^2))
cor(pred_elastic, test_data$target)^2 Summary
Method Spatially Informed? Cross-Validated? Notes OLS no no Benchmark Elastic Net yes (if lags used) yes Best for regularization + spatial predictors SAR / SEM (model-based) no Not ideal for prediction Let me know if you’d like:
Help engineering test set spatial lags Adding distance-weighted predictors instead of adjacency Geographically Weighted Regression (GWR) approaches for local fitting You’re very close — now it’s about testing predictive performance with and without spatial lags!
You want each test observation to include the average of neighboring training observations for some predictors — i.e., cross-set spatial lags.
Step-by-Step: Spatial Lag for Test Set Using Distance-Based Weights 1. Create a Combined Set of Coordinates
library(sf) library(spdep)
train_coords <- st_as_sf(train_data, coords = c(“lon”, “lat”), crs = 4326) test_coords <- st_as_sf(test_data, coords = c(“lon”, “lat”), crs = 4326) 2. Build Cross-Weights: Each Test Point → Nearest Train Points
We’ll use spdep::dnearneigh() to get distances within a certain threshold (e.g., 500km), or spdep::knn2nb() for k-nearest neighbors.
train_mat <- st_coordinates(train_coords) test_mat <- st_coordinates(test_coords)
library(FNN) knn_res <- get.knnx(train_mat, test_mat, k = 5)
test_nb <- knn2nb(knn_res$nn.index, row.names = rownames(test_data))
test_listw <- nb2listw(test_nb, style = “W”) 3. Compute Lagged Variables from Training Set
We want to apply spatial lags to test observations, using the training data values:
train_values <- train_data$x1 lag_x1_test <- lag.listw(test_listw, train_values)
test_data$lag_x1 <- lag_x1_test Repeat this for all variables you want to spatially lag:
vars_to_lag <- c(“x1”, “x2”, “x3”) # your actual variable names
for (var in vars_to_lag) { train_values <- train_data[[var]] lagged <- lag.listw(test_listw, train_values) test_data[[paste0(“lag_”, var)]] <- lagged } Final Use Now your test_data has spatially lagged versions of x1, x2, …, based on training neighbors. You can:
Build a model on training data (including lags), Predict on test data using those computed test lags. Notes This approach assumes test observations don’t influence each other — only use training neighbors to define lags. For prediction tasks, this is a principled way to inject spatial structure without leaking information from the test set into training. Let me know if you’d prefer inverse-distance weighting, kNN+distance decay, or help adapting this to county/state shapefiles with contiguity.
counties <- map_data("county")
ggplot(counties, aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black",
fill = skittles[1],
alpha = 0.5) +
theme(
panel.grid = element_blank(),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white"),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank()
) +
ggtitle("A beautiful, blank map of the continental United States")
zipcodeR is an incredibly handy package when working
with US data! The search_state() function will look up
information we need based on the states we have at the city- and
county-levels, which we can use with the
## Apply the search_state function to get all sorts of wonderful zipcode level details
stateInfo <- search_state(pneumoniaFull$State)
Take a look at Tallapoose County, AL. Is this going to create a merging issue? YES
stateInfo <- stateInfo %>%
drop_na()
PredictedReadmissionsRate.Summarize the county-level data to calculate the average pneumonia
related hospital readmissions so that you can merge this with the
stateInfo table above as well as the counties
dataframe that we extracted from map_data().
Hint 1: My code above should be an excellent starting point if you need it.
Hint 2: You may find it easiest at this point to go
ahead and rename your column CountyParish to match the
counties data frame! Also make sure to make it lower
case!
Hint 3: You are still going to need the state
information from the readmissions dataset! You will want to group by
both region AND subregion!
Hint 4: You will want to remove the NA
data before merging, but I would suggest doing it right before you group
and summarize.
## Your code here
pneumoniaFull <- pneumoniaFull %>%
mutate(subregion = tolower(gsub(" County", "", CountyParish)))
countyAggPneumonia <- pneumoniaFull %>%
## Grab readmissions and region
select(PredictedReadmissionRate, subregion, region,
`Score_Death rate for pneumonia patients`,
`Score_Hospital return days for pneumonia patients`,
`Score_Medicare spending per patient`) %>%
## Change subregion to a factor
mutate(subregion = as.factor(subregion),
region = as.factor(region)) %>%
## Change any other columns from character to numeric
mutate_if(is.character, as.numeric) %>%
## Group by subregion
group_by(subregion, region) %>%
summarize(meanReadmissions = mean(PredictedReadmissionRate, na.rm = TRUE),
medianReadmissions = median(PredictedReadmissionRate, na.rm = TRUE),
meanDeathRate = mean(`Score_Death rate for pneumonia patients`, na.rm = TRUE),
medianDeathRate = median(`Score_Death rate for pneumonia patients`, na.rm = TRUE),
meanHospitalReturnDays = mean(`Score_Hospital return days for pneumonia patients`, na.rm = TRUE),
medianHospitalReturnDays = median(`Score_Hospital return days for pneumonia patients`, na.rm = TRUE),
meanMedicareSpending = mean(`Score_Medicare spending per patient`, na.rm = TRUE),
medianMedicareSpending = median(`Score_Medicare spending per patient`, na.rm = TRUE)) %>%
filter(!is.na(subregion), !is.na(region)) ## Puerto Rico, Guam, other outlying terrirtors mising
# save(stateAggPneumonia, file = "stateAggPneumonia.Rdata")
when data are missing let’s set to zero
what data as missing our target predicted…?
countyAggPneumonia <- countyAggPneumonia %>%
mutate(medianReadmissions = replace_na(medianReadmissions, 0),
medianDeathRate = replace_na(medianDeathRate, 0))
# Then do the left join so no counties are lost
countyAggPneumonia <- counties %>%
left_join(countyAggPneumonia, by = c("region", "subregion")) %>%
arrange(group, order)
countyAggPneumonia %>%
# filter(region == "new jersey" | region == "new york") %>%
ggplot(aes(x = long, y = lat, group = group, fill = medianDeathRate)) +
geom_polygon(color = "black",
alpha = 1) +
scale_fill_continuous(name="Predicted % readmissions within 30 days",
low = "white", high = skittles[1],
na.value = "gray70" # Show missing counties in gray
) +
theme(
panel.grid = element_blank(),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white"),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.position = "bottom"
) +
labs(title = "Median pneumonia-related hospital readmissions rate by county",
subtitle = "Source: Centers for Medicare & Medicaid Services (CMS), 2024")
PredictedReadmissionsRate with the
counties data for mapping.Merge the counties data with the summarized readmissions
dataset you just made.
Note: I got you started here because there’s a bit
of trickery. You must arrange by the order
(which tells the polygons what order to plot in). Merging will jumble
them up and plot a big mess if you don’t order them!
## FILL IN THE BLANKS
# ___ <- merge(counties, ___, by = c("region", "subregion"), all.x = TRUE) %>% arrange(order)
Merge the counties data with the summarized readmissions
dataset you just made.
# Your code here
What do you notice immediately about the county-level data? Why do you think this is an issue? Are there some counties that legitimately lack hospitals entirely?
Your answer here.
Use the stateInfo dataframe to explore a
hypothesis of your own about county-level hospital
admissions. You could look at population density again, median income,
median home value, or even the number of housing units!
Hint: You would need to merge the data again with
the counties dataset so that you could make a map.
# Your code here